Training action selection neural networks using look-ahead search

ABSTRACT

Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the observation for use in updating the current values of the network parameters.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. Provisional Patent ApplicationNo. 62/511,945, filed on May 26, 2017, the disclosure of which is herebyincorporated by reference in its entirety.

BACKGROUND

This specification relates to selecting actions to be performed by areinforcement learning agent.

Reinforcement learning agents interact with an environment by receivingan observation that characterizes the current state of the environment,and in response, performing an action. Once the action is performed, theagent receives a reward that is dependent on the effect of theperformance of the action on the environment.

Some reinforcement learning systems use neural networks to select theaction to be performed by the agent in response to receiving any givenobservation.

Neural networks are machine learning models that employ one or morelayers of nonlinear units to predict an output for a received input.Some neural networks are deep neural networks that include one or morehidden layers in addition to an output layer. The output of each hiddenlayer is used as input to the next layer in the network, i.e., the nexthidden layer or the output layer. Each layer of the network generates anoutput from a received input in accordance with current values of arespective set of parameters.

SUMMARY

This specification describes technologies that relate to reinforcementlearning.

In one innovative aspect there is described a method of training aneural network having a plurality of network parameters. The neuralnetwork is used to select actions to be performed by an agentinteracting with an environment to perform a task in an attempt toachieve a specified result. The neural network is configured to receivean input observation characterizing a state of the environment and toprocess the input observation in accordance with the network parametersto generate a network output that comprises an action selection outputthat defines an action selection policy for selecting an action to beperformed by the agent in response to the input observation. The methodmay comprise receiving a current observation characterizing a currentstate of the environment. The method may further comprise determining atarget network output for the current observation by performing a lookahead search of possible future states of the environment starting fromthe current state until the environment reaches a possible future statethat satisfies one or more termination criteria. The look ahead searchmay be guided by the neural network in accordance with current values ofthe network parameters. The method may further comprise selecting anaction to be performed by the agent in response to the currentobservation using the target network output generated by performing thelook ahead search. The method may further comprise storing, in anexploration history data store, the target network output in associationwith the current observation for use in updating the current values ofthe network parameters.

Advantages of such an approach are described later, but these caninclude the ability to learn effectively in very large/complex statespaces and/or where there is a very sparse reward signal. In concreteterms this translates to a reinforcement learning system which canachieve substantially improved performance on a learned task whilst atthe same time substantially reducing the amount of processing power andmemory needed for training. This reduced processing power can translate,in implementations, into a significantly reduced electrical powerconsumption, for example by reducing the amount of specialist hardwareneeded to perform the training in a practical time frame. It can alsofacilitate implementing a high-performance reinforcement learning systemon a physically smaller computing device. Similar advantages can beachieved in implementations of a correspondingly trained reinforcementlearning system, described later.

In implementations the look ahead search may be a search through a statetree having nodes representing states of the environment, for examplestarting from a root node that represents the current state. Asdescribed later the data defining the tree may organized in anyconvenient manner. The search may continue until a terminal, e.g., leafnode, state of the search is reached representing a (possible) futurestate of the environment. In general this is different to a terminalstate of an episode of interactions which may be defined by performance(or failure of performance) of the task or otherwise as described later.

In some implementations the, or another, network provides a predictedexpected return output, i.e. an estimate of a return resulting from theenvironment being in the state. In broad terms this may be considered asa state-based value function. The method may then comprise determining atarget return based on evaluating progress of the task as determined atthe terminal state of the current episode of interaction, for examplebased on the end result achieved. This may be used for updating theneural network that generates the target network output.

Performing the look ahead search may comprise traversing the state treeuntil a leaf node is reached. This may comprise selecting one ofmultiple edges connecting to a first node, based on an action score forthe edge, to identify the next node in the tree. The action scorerelates to an action which, when performed, moves from a (possible)state of the environment represented by the first node to a (possible)state represented by the next node. The action score may optionally beadjusted by an amount dependent upon a prior probability for the action,which may be provided by the network output of the action selectionneural network. The adjustment may be reduced according to a count ofhow many times the respective edge has been traversed, to encourageexploration. Optionally noise may be added to the prior probabilitiesfor a node, in particular the root node for a look ahead search. Leafnodes may be evaluated in accordance with current values of the networkparameters, more particularly generating the prior probabilities for theoutgoing edges of the leaf node using a predicted probabilitydistribution over the actions. The action scores may be determined byinitializing an action score then updating the score using the resultsof one or more look ahead searches that traverse the corresponding edge.

In some implementations the method may further comprise obtaining, fromthe exploration history store, a training observation and a trainingtarget network output associated with the training observation. Thetraining observation may be processed using the neural network togenerate a training network output. The method may then determine agradient with respect to the network parameters of an objective functionthat encourages the training network output to match the training targetnetwork output, which may then be used to update the network parameters.

In broad terms, the look ahead search determines a target networkoutput, which may then be stored to provide the training target networkoutput later. The training target network output is used to improve theneural network, which is itself used for determining the target networkoutput in a next iteration. The (training) target network output fromthe look ahead search may comprise a vector of action scores orprobabilities (π); these may be proportional to the visit count N ofeach action from a root node of the search, or to N^(1/τ) where τ is atemperature.

In some implementations the network output may comprise both policydata, such as a vector of action scores or probabilities give a state,and state value data, such as the predicted expected return of a state,and both these may be updated. Thus the action selection output maydefine a probability distribution over possible actions to be performedby the agent. However in in general any reinforcement learning techniquemay be employed. Thus in some other implementations the action selectionoutput may comprises a respective Q value for each of a plurality ofpossible actions, where the Q-value represents an expected return to bereceived if the agent performs the possible action in response to theobservation. Alternatively the action selection output may directlyidentify or define an optimal action to be performed by the agent inresponse to the observation.

An objective function that encourages the training network output tomatch the training target network output may comprise, for example, anysuitable loss function which is dependent upon a measure of a difference(or similarity) between the training network output and the trainingtarget network output.

By way of example a suitable objective function may comprise a termdependent upon a difference between the probability distribution (π) inthe training target network output and the probability distribution (p)in the training network output, for example π^(T) log p. Additionally oralternatively a suitable objective function may comprise a termdependent upon a difference between the predicted expected return outputin the training target network output and the predicted expected returnoutput in the training network output, for example a mean squared errorbetween these terms.

There is also described a trained a neural network system comprising atrained neural network having a plurality of trained network parameters.The neural network system is configured to select actions to beperformed by an agent interacting with an environment to perform a taskin an attempt to achieve a specified result. The trained neural networkis configured to receive an input observation characterizing a state ofthe environment and to process the input observation in accordance withthe trained network parameters to generate a network output thatcomprises an action selection output that defines an action selectionpolicy for selecting an action to be performed by the agent in responseto the input observation. The neural network system may comprise aninput to receive a current observation characterizing a current state ofthe environment. The neural network system may further comprise anoutput for selecting an action to be performed by the agent in responseto the current observation according to the action selection output. Theneural network system may be configured to provide the output forselecting the action by performing a look ahead search, wherein the lookahead search comprises a search of possible future states of theenvironment starting from the current state until the environmentreaches a possible future state that satisfies one or more terminationcriteria. The look ahead search may guided by the trained neural networkin accordance with values of the network parameters, in particular suchthat the search is dependent upon the action selection output from thetrained neural network.

There is further provided a method of training a controller neuralnetwork, wherein the controller neural network has a state vector inputto receive state data from a subject system having a plurality ofstates; an action probability vector data output to output an actionprobability vector defining a set of probabilities for implementing eachof a corresponding set of actions, wherein an action performed on thesubject system moves the system from a state defined by the state datato another of the states; a baseline value data output to output abaseline value dependent upon a baseline likelihood of the subjectsystem providing a reward when in a state defined by the state data; anda plurality of neural network layers between the state vector input andthe action probability vector data output and the value data output,wherein the layers are connected by a plurality of weights; wherein themethod comprises: implementing a succession of actions on the subjectsystem to move the system through a succession of the states, whereinactions of the succession of actions are selected according to a guidedlook ahead search guided by the controller neural network; generatingand storing probability data for actions and states selected accordingto the guided look ahead search; updating the controller neural networkusing the stored data; and implementing a further succession of actionson the subject system according to a guided look ahead search guided bythe updated controller neural network.

There is also provided a data processor comprising a neural networkincluding: a state vector input to receive state data from a subjectsystem having a plurality of states; an action probability vector dataoutput to output an action probability vector defining a set ofprobabilities for implementing each of a corresponding set of actions,wherein an action performed on the subject system moves the system froma state defined by the state data to another of the states; a baselinevalue data output to output a baseline value dependent upon a baselinelikelihood of the subject system providing a reward when in a statedefined by the state data; and a plurality of neural network layersbetween the state vector input and the action probability vector dataoutput and the value data output, wherein the layers are connected by aplurality of weights; and a training module configured to: implement asuccession of actions on the subject system to move the system through asuccession of the states, wherein actions of the succession of actionsare selected according to a guided look ahead search guided by theneural network; generate guided look ahead search probabilities foractions and states selected according to the guided look ahead search;store in memory state exploration history data comprising, for states ofthe succession of states, state data defining the states, guided lookahead search probability data for the states, and reward data definingan expected or actual reward associated with the state; train the neuralnetwork using the state exploration history data to update the weightsof the neural network; and implement a further succession of actions onthe subject system according to a guided look ahead search guided by theneural network with the updated weights, to generate further stateexploration history data for training the neural network.

There is also provided a trained electronic controller comprising: astate vector input to receive state data from a subject system having aplurality of states; an action probability vector data output to outputan action probability vector defining a set of probabilities forimplementing each of a corresponding set of actions, wherein an actionperformed on the subject system moves the system from a state defined bythe state data to another of the states; a baseline value data output tooutput a baseline value dependent upon a baseline likelihood of thesubject system providing a reward when in a state defined by the statedata; and a plurality of neural network layers between the state vectorinput and the action probability vector data output and the value dataoutput, wherein the layers are connected by a plurality of weights; anda control system configured to: implement a succession of actions on thesubject system to move the system through a succession of the states,wherein actions of the succession of actions are selected according to aguided look ahead search guided by the neural network, wherein theguided look ahead search comprises a search through potential futureactions and states guided by the neural network such that the searchthrough potential future actions and states is dependent upon one orboth of the action probability vector and the baseline value from theneural network from successive potential future states defined by thepotential future actions.

The subject matter described in this specification can be implemented inparticular embodiments so as to realize one or more of the followingadvantages. Actions to be performed by an agent interacting with anenvironment to perform a task that has a very large state space can beeffectively selected. In other words, the actions can be effectivelyselected to maximize the likelihood that a desired result, such asperformance of a learned task, will be achieved. In particular, actionscan effectively be selected even when the environment has a state treethat is too large to be exhaustively searched. By using a neural networkto guide a look ahead search during learning, the effectiveness of thetraining process can be increased and the neural network can be trainedto have a high level of performance on the task over fewer trainingiterations and using fewer computational resources. In implementations,by using the same neural network to predict both the action selectionpolicy and the value of the current state, i.e., the predicted return,the amount of computational resources consumed by the neural network toeffectively select an action can be reduced. Additionally, by employingthe described guided look ahead search during learning, the neuralnetwork can be trained to achieve a high level of performance on thetask with no external supervision (other than a very sparse rewardsignal) or any human or other expert data.

The details of one or more embodiments of the subject matter of thisspecification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example reinforcement learning system.

FIG. 2 is a diagram of generating an experience history and using theexperience history to update the values of the parameters of the neuralnetwork.

FIG. 3 is a flow diagram of an example process for generating anexperience history.

FIG. 4 is a flow diagram of an example process for performing a lookahead search.

FIG. 5 is a flow diagram of an example process for determining an updateto the current network parameter values.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

This specification generally describes a reinforcement learning systemthat selects actions to be performed by a reinforcement learning agentinteracting with an environment. In order to interact with theenvironment, the reinforcement learning system receives datacharacterizing the current state of the environment and selects anaction from a set of actions to be performed by the agent in response tothe received data. Once the action has been selected by thereinforcement learning system, the agent performs the action to interactwith the environment.

Generally, the agent interacts with the environment in order to performa particular task, i.e., achieve a specified result, and thereinforcement learning system selects actions in order to maximize thelikelihood of completing the task, i.e., of achieving the result.

In some implementations, the environment is a real-world environment andthe agent is a control system for a mechanical agent interacting withthe real-world environment. For example, the agent may be a controlsystem integrated in an autonomous or semi-autonomous vehicle navigatingthrough the environment. In these implementations, the actions may bepossible control inputs to control the vehicle and the result that theagent is attempting to achieve is to satisfy objectives for thenavigation of the vehicle through the real-world environment. Forexample, the objectives can include one or more of: reaching adestination, ensuring the safety of any occupants of the vehicle,minimizing energy used in reaching the destination, maximizing thecomfort of the occupants, and so on.

As another example, the agent may be a robot or other mechanical agentinteracting with the environment to achieve a specific task, e.g., tolocate an object of interest in the environment or to move an object ofinterest to a specified location in the environment. In theseimplementations, the actions may be possible control inputs to controlthe robot.

In some other implementations the real-world environment may be amanufacturing plant or service facility, the observations may relate tooperation of the plant or facility, for example to resource usage suchas power consumption, and the agent may control actions or operations inthe plant/facility, for example to reduce resource usage.

Thus in general terms, in implementations the agent may be a mechanicalor electronic agent and the actions may comprise control inputs tocontrol the mechanical or electronic agent. The observations may bederived from sensors, for example image sensors, and/or they may bederived from electrical or mechanical signals from the agent.

In some further implementations, the environment is a real-worldenvironment and the agent is a computer system that generates outputsfor presentation to a user.

For example, the environment may be a patient diagnosis environment suchthat each state is a respective patient state of a patient, i.e., asreflected by health data characterizing the health of the patient, andthe agent may be a computer system for suggesting treatment for thepatient. In this example, the actions in the set of actions are possiblemedical treatments for the patient and the result to be achieved caninclude one or more of maintaining a current health of the patient,improving the current health of the patient, minimizing medical expensesfor the patient, and so on. The observations may comprise data from oneor more sensors, such as image sensors or biomarker sensors, and/or maycomprise processed text, for example from a medical record.

As another example, the environment may be a protein folding environmentsuch that each state is a respective state of a protein chain and theagent is a computer system for determining how to fold the proteinchain. In this example, the actions are possible folding actions forfolding the protein chain and the result to be achieved may include,e.g., folding the protein so that the protein is stable and so that itachieves a particular biological function. As another example, the agentmay be a mechanical agent that performs or controls the protein foldingactions selected by the system automatically without human interaction.The observations may comprise direct or indirect observations of a stateof the protein and/or may be derived from simulation.

In some other implementations, the environment is a simulatedenvironment and the agent is implemented as one or more computerprograms interacting with the simulated environment. For example, thesimulated environment may be a virtual environment in which a usercompetes against a computerized agent to accomplish a goal and the agentis the computerized agent. In this example, the actions in the set ofactions are possible actions that can be performed by the computerizedagent and the result to be achieved may be, e.g., to win the competitionagainst the user.

Generally, the system uses a neural network to select actions to beperformed by the agent interacting with the environment. The neuralnetwork has a set of network parameters and is configured to receive aninput observation characterizing a state of the environment and toprocess the input observation in accordance with the network parametersto generate a network output that includes an action selection outputthat defines an action selection policy for selecting an action to beperformed by the agent in response to the input observation.

FIG. 1 shows an example reinforcement learning system 100. Thereinforcement learning system 100 is an example of a system implementedas computer programs on one or more computers in one or more locationsin which the systems, components, and techniques described below areimplemented.

The reinforcement learning system 100 selects actions to be performed bya reinforcement learning agent 102 interacting with an environment 104.That is, the reinforcement learning system 100 receives observations,with each observation being data characterizing a respective state ofthe environment 104, and, in response to each received observation,selects an action from a set of actions to be performed by thereinforcement learning agent 102 in response to the observation.

Once the reinforcement learning system 100 selects an action to beperformed by the agent 102, the reinforcement learning system 100instructs the agent 102 and the agent 102 performs the selected action.Generally, the agent 102 performing the selected action results in theenvironment 104 transitioning into a different state.

The observations characterize the state of the environment in a mannerthat is appropriate for the context of use for the reinforcementlearning system 100.

For example, when the agent 102 is a control system for a mechanicalagent interacting with the real-world environment, the observations maybe images captured by sensors of the mechanical agent as it interactswith the real-world environment and, optionally, other sensor datacaptured by the sensors of the agent.

As another example, when the environment 104 is a patient diagnosisenvironment, the observations may be data from an electronic medicalrecord of a current patient.

As another example, when the environment 104 is a protein foldingenvironment, the observations may be images of the current configurationof a protein chain, a vector characterizing the composition of theprotein chain, or both.

In particular, the reinforcement learning system 100 selects actionsusing an action selection neural network 130.

Generally, the action selection neural network 130 is a neural networkthat is configured to receive a network input including an observationand to process the network input in accordance with parameters of theaction selection neural network (“network parameters”) to generate anetwork output. The network output includes an action selection outputand, in some cases, a predicted expected return output. Typically thepredicted expected return is a second output from another “head” of theaction selection neural network but it may be generated by a separatesecond neural network which may be jointly trained with the actionselection neural network.

The action selection output defines an action selection policy forselecting an action to be performed by the agent in response to theinput observation.

In some cases, the action selection output defines a probabilitydistribution over possible actions to be performed by the agent. Forexample, the action selection output can include a respective actionprobability for each action in a set of possible actions that can beperformed by the agent to interact with the environment. In anotherexample, the action selection output can include parameters of adistribution over the set of possible actions.

In some other cases, the action selection output includes a respective Qvalue for each of a plurality of possible actions. A Q value for apossible action represents an expected return to be received if theagent performs the possible action in response to the observation.

In some cases, the action selection output identifies an optimal actionfrom the set of possible action to be performed by the agent in responseto the observation. For example, in the case of controlling a mechanicalagent, the action selection output can identify torques to be applied toone or more joints of the mechanical agent.

The predicted expected return output for a given observation is anestimate of a return resulting from the environment being in the statecharacterized by the observation, with the return typically being anumeric reward or a combination, e.g., a time-discounted sum, of numericrewards received as a result of the agent interacting with theenvironment. The predicted expected return may be designated by a scalarvalue V(s). Generally, the rewards reflect the progress of the agenttoward accomplishing the specified result. In many cases, the rewardswill be sparse, with the only reward being received being at a terminalstate of any given episode of interactions and indicating whether thespecified result was achieved or not.

The remainder of this specification will describe cases where the actionselection output is or defines a probability distribution over thepossible actions to be performed by the agent. However, one of ordinaryskill in the art will appreciate that these cases can readily be adaptedto train and use neural networks that generate one of the other kinds ofaction selection output referenced above.

To allow the agent 102 to effectively interact with the environment 104,the reinforcement learning system 100 trains the neural network 130 todetermine trained values of the network parameters.

In particular, the training includes two parts: an action selection partand a parameter updating part.

During the action selection part, an action selection subsystem 120receives observations and selects actions to be performed by the agentin response to the observations to interact with the environment byperforming a look ahead search using the neural network 130. Inparticular, for a given episode of interactions, i.e., a set ofinteractions that starts at an initial state and ends at a terminalstate for the episode, the action selection subsystem 120 selectsactions based on the results of the look ahead search instead ofdirectly using the action selection outputs generated by the neuralnetwork 130. Based on the results of the interactions, the actionselection subsystem 120 generates exploration histories and stores theexploration histories in an exploration history data store 140.Performing the look ahead search and generating exploration histories isdescribed below with reference to FIGS. 2-4.

During the parameter updating part, a parameter updating subsystem 110obtains exploration histories from the exploration history data store140 and uses the histories to update the values of the parameters of theneural network 130. Updating the values of the parameters of the neuralnetwork 130 is described below with reference to FIG. 5.

During the training, the action selection subsystem 120 and theparameter updating subsystem 110 repeatedly perform their correspondingparts of the training process asynchronously and in parallel. In somecases, one or both of these parts of the process are distributed, sothat many different instances of the neural network 130 are trained inparallel, with the computation spread out across multiple computingunits, e.g., multiple computers or multiple cores within a singlecomputer.

In some implementations, the action selection subsystem 120 performs thelook ahead search using a simulated version of the environment 104.

Generally, the simulated version of the environment 104 is a virtualizedenvironment that simulates how actions performed by the agent 120 wouldaffect the state of the environment 104.

For example, when the environment 104 is a real-world environment andthe agent is an autonomous or semi-autonomous vehicle, the simulatedversion of the environment is a motion simulation environment thatsimulates navigation through the real-world environment. That is, themotion simulation environment simulates the effects of various controlinputs on the navigation of the vehicle through the real-worldenvironment. More generally, when the environment 104 is a real-worldenvironment and the agent is a mechanical agent the simulated version ofthe environment is a dynamics model that models how actions performed bythe agent change the state of the environment 104.

As another example, when the environment 104 is a patient diagnosisenvironment, the simulated version of the environment is a patienthealth simulation that simulates effects of medical treatments onpatients. For example, the patient health simulation may be a computerprogram that receives patient information and a treatment to be appliedto the patient and outputs the effect of the treatment on the patient'shealth.

As another example, when the environment 104 is a protein foldingenvironment, the simulated version of the environment is a simulatedprotein folding environment that simulates effects of folding actions onprotein chains. That is, the simulated protein folding environment maybe a computer program that maintains a virtual representation of aprotein chain and models how performing various folding actions willinfluence the protein chain.

As another example, when the environment 104 is the virtual environmentdescribed above, the simulated version of the environment is asimulation in which the user is replaced by another computerized agent.

In some implementations, the action selection subsystem 120 performs thelook ahead search by performing a tree search guided by the outputs ofthe neural network 130. In particular, in these implementations, theaction selection subsystem 120 maintains data representing a state treeof the environment 104. The state tree includes nodes that representstates of the environment 104 and directed edges that connect nodes inthe tree. An outgoing edge from a first node to a second node in thetree represents an action that was performed in response to anobservation characterizing the first state and resulted in theenvironment transitioning into the second state.

While the data is logically described as a tree, the action selectionsubsystem 120 can be represented by any of a variety of convenientphysical data structures, e.g., as multiple triples or as an adjacencylist.

The action selection subsystem 120 also maintains edge data for eachedge in the state tree that includes (i) an action score for the actionrepresented by the edge, (ii) a visit count for the action representedby the edge, and (iii) a prior probability for the action represented bythe edge.

At any given time, the action score for an action represents the currentestimate of the return that will be received if the action is performed,the visit count for the action is the current number of times that theaction has been performed by the agent 102 in response to observationscharacterizing the respective first state represented by the respectivefirst node for the edge, and the prior probability represents thelikelihood that the action is the action that should be performed 102 inresponse to observations characterizing the respective first state.

The action selection subsystem 120 updates the data representing thestate tree and the edge data for the edges in the state tree from lookahead searches performed using the neural network 130 during training.Performing the look ahead search is described in more detail below withreference to FIGS. 2-4.

In some implementations, the system 100 uses the action selectionoutputs generated by the neural network 130 directly in selectingactions in response to observations after the neural network 130 hasbeen trained. In other implementations, the system 100 continuesperforming look ahead searches using the neural network 130 andselecting actions to be performed by the agent using the results ofthose look ahead searches even after the neural network has beentrained.

FIG. 2 is a diagram 200 of generating an experience history and usingthe experience history to update the values of the parameters of theneural network 130.

As shown in FIG. 2, during one episode of performing a task, the systemselects actions to be performed by the agent in response to observationss₁ through s_(T), i.e., starting from when the environment is in aninitial state s₁ and until the environment reaches a terminal states_(T). The terminal state of an episode may be a state in which thespecified result has been achieved, a state that the environment is inafter a specified number of actions have been performed after theenvironment was in the initial state without the specified result havingbeen achieved, or a state in which the system determines that the resultis not likely to be achieved.

In particular, in response to each of the observations s the systemgenerates a target action selection output π that defines a probabilitydistribution over the set of actions and then samples an action a to beperformed by the agent from the probability distribution defined by thetarget action selection output π. For example, the system samples anaction a₂ to be performed by the agent in response to the observation s₂from the target action selection output π₂.

To generate a target action selection output in response to a givenobservation, the system uses the neural network 130 to perform a lookahead search, i.e., a look ahead search that is guided by the neuralnetwork in accordance with current values of the network parameters.Thus, the system performs the look ahead search using the neural networkto select the actions for the agent instead of using the neural networkto directly select the action.

In cases where the network output also includes a predicted return, thesystem also obtains a target return z after the terminal observation Tof the episode. The target return can be based on whether the terminalstate of the environment achieved the specified result, e.g., 1 if theterminal state achieves the result and 0 or −1 if the terminal statedoes not achieve the result.

For each of the observations encountered during the instance ofperforming the task, the system also generates an experience historythat includes the observation, the target output generated in responseto the observation through the look ahead search, and the target returnz. For example, the system generates an experience history that includesthe observation s₂ the target output π₂ and the target return z.

The system can then (asynchronously from generating the experiencehistories), use the experience histories to update the values of theparameters of the neural network 130. In particular, for an experiencehistory that includes an observation s, a target output π and targetreturn z, the system processes the observation s using the neuralnetwork 130 to generate a network output that includes an actionselection output p and a predicted return v. The system then trains theneural network 130 by adjusting the values of the network parameters sothat the action selection output p more closely matches the targetoutput π and the predicted return v more closely matches the targetreturn z.

FIG. 3 is a flow diagram of an example process 300 for generating anexperience history. For convenience, the process 300 will be describedas being performed by a system of one or more computers located in oneor more locations. For example, a reinforcement learning system, e.g.,the reinforcement learning system 100 of FIG. 1, appropriatelyprogrammed in accordance with this specification, can perform theprocess 300.

The system receives a current observation (step 302).

The system generates a target action selection output for theobservation by performing a look ahead search using the neural networkand in accordance with current values of the neural network (step 304).In particular, the system performs the look ahead search to traversepossible future states of the environment starting from the currentstate characterized by the current observation. The system continues thelook ahead search until a possible future state that satisfiestermination criteria is encountered. For example, the look ahead searchmay be a tree search and the criteria may be that the future state isrepresented by a leaf node in the state tree. Performing the look aheadsearch will be described in more detail below with reference to FIG. 4.

The system selects an action to be performed by the agent in response tothe current observation using the target action selection output (step306), e.g., by sampling from the probability distribution defined by thetarget action selection output.

The system obtains a target return (step 308). The target return is thereturn obtained by the system starting from after the action wasperformed and ending when the environment reaches the terminal state ofthe current episode. As described above, the target return reflects theprogress of the agent towards achieving the specified result startingfrom being in the state characterized by the current observation. Inmany cases, the rewards are sparse and are only received at the terminalstate.

The system generates an experience history that includes the currentobservation and a target network output that includes the target actionselection output and the target return (step 310) and then stores thegenerated experience history in the experience history data store.

The system can repeat the process 300 for each observation receivedduring an episode of interaction of the agent with the environment.

FIG. 4 is a flow diagram of an example process 400 for performing a lookahead search of an environment using a neural network. For convenience,the process 400 will be described as being performed by a system of oneor more computers located in one or more locations. For example, areinforcement learning system, e.g., the reinforcement learning system100 of FIG. 1, appropriately programmed in accordance with thisspecification, can perform the process 400.

The system receives data identifying a root node for the search, i.e., anode representing the state characterized by the current observation(step 402).

The system traverses the state tree until the state tree reaches a leafstate, i.e., a state that is represented by a leaf node in the statetree (step 404).

That is, at each in-tree node, i.e., a node encountered starting fromthe root node until reaching the leaf state, the system selects the edgeto be traversed using the edge data for the outgoing edges from thein-tree node representing the in-tree state. The system may select theedge based on the action score or may determine an adjusted action scorefor selecting an edge.

In particular, for each outgoing edge from an in-tree node, the systemmay determine an adjusted action score for the edge based on the actionscore (Q(s,a)) for the edge, the visit count (N) for the edge, and theprior probability (P(s,a)) for the edge (described further later).Generally, the system computes the adjusted action score for a givenedge by adding to the action score for the edge a bonus that isproportional to the prior probability for the edge but decays withrepeated visits to encourage exploration. For example, the bonus may bedirectly proportional to the product of the prior probability and aratio that has the square root of the sum of all visit counts for alloutgoing edges from the root node as the numerator and a constant, e.g.,one, plus the visit count for the edge representing the action as thedenominator. For example the bonus may be dependent upon P(s,a)/(1+N).

The system then selects the edge with the highest adjusted action scoreas the edge to be traversed from the in-tree node.

In some cases, to further drive exploration of the state space, thesystem adds noise to the prior probabilities for the root node beforeselecting an action for the root node. For example, the system mayinterpolate between the actual prior probability for a given action andnoise sampled from a Dirichlet process to generate the final priorprobability that is used when selecting the action to be performed whenat the root node.

The system continues traversing the state tree in this manner until aleaf node in the state tree is reached. Generally, a leaf node is a nodein the state tree that has no child nodes, i.e., is not connected to anyother nodes by an outgoing edge.

The system then expands the leaf node (step 406).

To expand the leaf node, the system may add a respective new edge to thestate tree for each action that is a valid action to be performed by theagent in response to a leaf observation characterizing the staterepresented by the leaf node. The system also initializes the edge datafor each new edge by setting the visit count and action scores for thenew edge to zero.

The system evaluates the leaf node using the action selection neuralnetwork in accordance with the current values of the parameters togenerate a respective prior probability for each new edge (step 408). Todetermine the prior probability for each new edge, the system mayprocess the leaf observation using the action selection neural networkand uses the action probabilities from the distribution defined by thenetwork output as the prior probabilities for the corresponding edges.The system may also generate a predicted return for the leaf observationfrom the results of the processing of the leaf observation by the neuralnetwork.

The system then updates the edge data for the edges traversed during thesearch based on the predicted return for the leaf node (step 410).

In particular, for each edge that was traversed during the search, thesystem increments the visit count for the edge by a predeterminedconstant value, e.g., by one. The system also updates the action scorefor the edge using the predicted expected return for the leaf node bysetting the action score equal to the new average of the predictedexpected returns of all searches that involved traversing the edge.

The system determines the target action selection output for the currentobservation using the results of the look ahead search (step 412). Inparticular, the system determines the target network output using thevisit counts for the outgoing edges from the root node after the edgedata has been updated based on the results of the look ahead search. Forexample, the system can apply a softmax over the visit counts for theoutgoing edges from the root node to determine the probabilities in thetarget network output. In some implementations, the softmax has areduced temperature to encourage exploration of the state space. In someimplementations, the softmax temperature is only reduced after athreshold number of look ahead searches have been performed within anepisode to ensure that a diverse set of states are encountered duringvarious episodes.

As described above, in some cases the system uses a simulated version ofthe environment to perform the look ahead search, e.g., to identifywhich state taking an action leads to when in a leaf state, to identifywhich states taking an action leads to when the outgoing edge for theaction is not connected to any node in the tree, or to verify that theedge data for an in-tree node accurately reflects the transitions thatwill occur when a given action is selected.

In some implementations, the system distributes the searching of thestate tree, i.e., by running multiple different searches in parallel onmultiple different machines, i.e., computing devices, or in multiplethreads on one or more such machines.

For example, the system may implement an architecture that includes amaster machine that executes the main search. The entire state tree maybe stored on the master, which only executes the in-tree phase of eachsimulation. The leaf positions are communicated to one or more workers,which execute the expansion and evaluation phase of the simulation.

In some cases, the system does not update the edge data until apredetermined number of look ahead searches have been performed since amost-recent update of the edge data, e.g., to improve the stability ofthe search process in cases where multiple different searches are beingperformed in parallel.

FIG. 5 is a flow diagram of an example process 500 for determining anupdate to current values of the network parameters. For convenience, theprocess 500 will be described as being performed by a system of one ormore computers located in one or more locations. For example, areinforcement learning system, e.g., the reinforcement learning system100 of FIG. 1, appropriately programmed in accordance with thisspecification, can perform the process 500.

The system obtains an experience history from the experience historydata store (step 510). For example, the experience history can be one ofa batch of experience histories sampled from the experience history datastore.

The system processes the observation in the experience history using theneural network and in accordance with the current values of the networkparameters to generate a training network output (step 520).

The system determines a gradient with respect to the network parametersof an objective function that encourages the training network output tomatch the training target network output in the experience history (step530).

When the training network output and the experience history include onlyan action selection output, the objective function can measure adifference between the probability distribution in the training targetnetwork output and the probability distribution in the training networkoutput. For example the objective function can be a cross entropy lossfunction.

When the training network output and the experience history include bothan action selection output and a return, the objective function can be aweighted sum between (i) a difference between the probabilitydistribution in the training target network output and the probabilitydistribution in the training network output, e.g., a cross entropy lossbetween the two distributions, and (ii) a difference between thepredicted expected return output in the training target network outputand the predicted expected return output in the training network output,e.g., a mean squared difference. In either case, the objective functioncan also include one or more regularization terms, e.g., an L2regularization term, to prevent overfitting.

Thus, optimizing the objective function encourages the neural network togenerate action selection outputs that match the target action selectionoutput and, when used, to generate predicted returns that match thetarget predicted returns.

The system determines an update to the current values of the networkparameters from the gradient (step 540), i.e., uses the gradient as theupdate to the network parameters.

The system can perform the process 500 for each experience history in abatch of experience histories to determine a respective update for eachof the experience histories. The system can then apply the updates tothe current values of the network parameters to determine updated valuesof the parameters, i.e., in accordance with an update rule that isspecified by the optimizer used by the system to train the neuralnetwork. For example, when the optimizer is a stochastic gradientdescent optimizer, the system can sum all of the updates, multiply thesum by a learning rate, and then add the result to the current values ofthe network parameters.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non-transitory program carrier for execution by, or to controlthe operation of, data processing apparatus. Alternatively or inaddition, the program instructions can be encoded on anartificially-generated propagated signal, e.g., a machine-generatedelectrical, optical, or electromagnetic signal, that is generated toencode information for transmission to suitable receiver apparatus forexecution by a data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can also beor further include special purpose logic circuitry, e.g., an FPGA (fieldprogrammable gate array) or an ASIC (application-specific integratedcircuit). The apparatus can optionally include, in addition to hardware,code that creates an execution environment for computer programs, e.g.,code that constitutes processor firmware, a protocol stack, a databasemanagement system, an operating system, or a combination of one or moreof them.

A computer program (which may also be referred to or described as aprogram, software, a software application, a module, a software module,a script, or code) can be written in any form of programming language,including compiled or interpreted languages, or declarative orprocedural languages, and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A computer program may, butneed not, correspond to a file in a file system. A program can be storedin a portion of a file that holds other programs or data, e.g., one ormore scripts stored in a markup language document, in a single filededicated to the program in question, or in multiple coordinated files,e.g., files that store one or more modules, sub-programs, or portions ofcode. A computer program can be deployed to be executed on one computeror on multiple computers that are located at one site or distributedacross multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby, and apparatus can also be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application-specific integrated circuit).

Computers suitable for the execution of a computer program include, byway of example, can be based on general or special purposemicroprocessors or both, or any other kind of central processing unit.Generally, a central processing unit will receive instructions and datafrom a read-only memory or a random access memory or both. The essentialelements of a computer are a central processing unit for performing orexecuting instructions and one or more memory devices for storinginstructions and data. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto-optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks. The processor and the memory can besupplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's client device in response to requests received from the webbrowser.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a relationship graphical user interface or a Webbrowser through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back-end, middleware, or front-end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (“LAN”) and a widearea network (“WAN”), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or of what may be claimed, but rather as descriptions offeatures that may be specific to particular embodiments of particularinventions. Certain features that are described in this specification inthe context of separate embodiments can also be implemented incombination in a single embodiment. Conversely, various features thatare described in the context of a single embodiment can also beimplemented in multiple embodiments separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various system modulesand components in the embodiments described above should not beunderstood as requiring such separation in all embodiments, and itshould be understood that the described program components and systemscan generally be integrated together in a single software product orpackaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In certain implementations, multitasking and parallelprocessing may be advantageous.

What is claimed is:
 1. A method of training a neural network having aplurality of network parameters, wherein the neural network is used toselect actions to be performed by an agent interacting with anenvironment to perform a task in an attempt to achieve a specifiedresult, wherein the neural network is configured to receive an inputobservation characterizing a state of the environment and to process theinput observation in accordance with the network parameters to generatea network output that comprises an action selection output that definesan action selection policy for selecting an action to be performed bythe agent in response to the input observation, and wherein the methodcomprises: receiving a current observation characterizing a currentstate of the environment; determining a target network output for thecurrent observation by performing a look ahead search of possible futurestates of the environment starting from the current state until theenvironment reaches a possible future state that satisfies one or moretermination criteria, wherein the look ahead search is guided by theneural network in accordance with current values of the networkparameters; selecting an action to be performed by the agent in responseto the current observation using the target network output generated byperforming the look ahead search; and storing, in an exploration historydata store, the target network output in association with the currentobservation for use in updating the current values of the networkparameters.
 2. The method of claim 1, wherein the action selectionoutput defines a probability distribution over possible actions to beperformed by the agent.
 3. The method of claim 1, wherein the actionselection output comprises a respective Q value for each of a pluralityof possible actions that represents an expected return to be received ifthe agent performs the possible action in response to the observation.4. The method of claim 1, wherein the action selection output identifiesan optimal action to be performed by the agent in response to theobservation.
 5. The method of claim 1, wherein the network outputfurther comprises a predicted expected return output that is an estimateof a return resulting from the environment being in the state, andwherein determining the target network output comprises: determining atarget return based on evaluating a progress of the task as of aterminal state of a current episode of interaction.
 6. The method ofclaim 5, wherein the return is dependent on whether the specified resultis achieved as of the terminal state.
 7. The method of claim 1, whereinthe look ahead search is a tree search of a state tree having nodesrepresenting states of the environment starting from a root node thatrepresents the current state.
 8. The method of claim 7, whereinperforming the look ahead search comprises adding noise to priorprobabilities for the root node that are used to traverse from the rootnode to other nodes in the state tree.
 9. The method claim 7, whereinperforming the look ahead search comprises evaluating leaf nodes of thestate tree encountered during the look ahead search using the neuralnetwork and in accordance with current values of the network parameters.10. The method of claim 1, further comprising: obtaining, from theexploration history store, a training observation and a training targetnetwork output associated with the training observation; processing thetraining observation using the neural network and in accordance with thecurrent values of the network parameters to generate a training networkoutput; determining a gradient with respect to the network parameters ofan objective function that encourages the training network output tomatch the training target network output; and determining an update tothe current values of the network parameters from the gradient.
 11. Themethod of claim 10, wherein the network output includes an actionselection output that defines a probability distribution over possibleactions to be performed by the agent and a predicted expected returnoutput that is an estimate of a return resulting from the environmentbeing in the state, and wherein the objective function is a weighted sumbetween (i) a difference between the probability distribution in thetraining target network output and the probability distribution in thetraining network output and (ii) a difference between the predictedexpected return output in the training target network output and thepredicted expected return output in the training network output.
 12. Atrained neural network system comprising a trained neural network havinga plurality of trained network parameters and that is implemented by oneor more computers, wherein the neural network system is configured toselect actions to be performed by an agent interacting with anenvironment to perform a task in an attempt to achieve a specifiedresult, wherein the trained neural network is configured to receive aninput observation characterizing a state of the environment and toprocess the input observation in accordance with the trained networkparameters to generate a network output that comprises an actionselection output that defines an action selection policy for selectingan action to be performed by the agent in response to the inputobservation, and wherein the neural network system comprises: an inputto receive a current observation characterizing a current state of theenvironment; and an output for selecting an action to be performed bythe agent in response to the current observation according to the actionselection output; and wherein the neural network system is configured toprovide the output for selecting the action by performing a look aheadsearch, wherein the look ahead search comprises a search of possiblefuture states of the environment starting from the current state untilthe environment reaches a possible future state that satisfies one ormore termination criteria, and wherein the look ahead search is guidedby the trained neural network in accordance with values of the networkparameters.
 13. A trained neural network system as claimed in claim 12wherein the look ahead search is guided such that the search isdependent upon the action selection output from the trained neuralnetwork.
 14. A system comprising one or more computers and one or morestorage devices storing instructions that when executed by the one ormore computers cause the one or more computers to perform operations fortraining a neural network having a plurality of network parameters,wherein the neural network is used to select actions to be performed byan agent interacting with an environment to perform a task in an attemptto achieve a specified result, wherein the neural network is configuredto receive an input observation characterizing a state of theenvironment and to process the input observation in accordance with thenetwork parameters to generate a network output that comprises an actionselection output that defines an action selection policy for selectingan action to be performed by the agent in response to the inputobservation, and wherein the operations comprise: receiving a currentobservation characterizing a current state of the environment;determining a target network output for the current observation byperforming a look ahead search of possible future states of theenvironment starting from the current state until the environmentreaches a possible future state that satisfies one or more terminationcriteria, wherein the look ahead search is guided by the neural networkin accordance with current values of the network parameters; selectingan action to be performed by the agent in response to the currentobservation using the target network output generated by performing thelook ahead search; and storing, in an exploration history data store,the target network output in association with the current observationfor use in updating the current values of the network parameters. 15.One or more non-transitory computer-readable storage media storinginstructions that when executed by one or more computers cause the oneor more computers to perform operations for training a neural networkhaving a plurality of network parameters, wherein the neural network isused to select actions to be performed by an agent interacting with anenvironment to perform a task in an attempt to achieve a specifiedresult, wherein the neural network is configured to receive an inputobservation characterizing a state of the environment and to process theinput observation in accordance with the network parameters to generatea network output that comprises an action selection output that definesan action selection policy for selecting an action to be performed bythe agent in response to the input observation, and wherein theoperations comprise: receiving a current observation characterizing acurrent state of the environment; determining a target network outputfor the current observation by performing a look ahead search ofpossible future states of the environment starting from the currentstate until the environment reaches a possible future state thatsatisfies one or more termination criteria, wherein the look aheadsearch is guided by the neural network in accordance with current valuesof the network parameters; selecting an action to be performed by theagent in response to the current observation using the target networkoutput generated by performing the look ahead search; and storing, in anexploration history data store, the target network output in associationwith the current observation for use in updating the current values ofthe network parameters.